Introduction to Open Data Science - Course Project

Chapter 1

Here I am writing a short description about the course (Or perhaps more like a few thoughts about the course). I include a link to my GitHub repository here as well. This is an R Markdown (.Rmd) file so I’ll be using R Markdown syntax.

The R chunck below prints out the date and time at the time of knitting this Markdown document for no particular reason.

# This is a so-called "R chunk" where I can write my R code.

date()
## [1] "Sat Dec  3 22:35:17 2022"

Okay, so I’m feeling great and on this course I expect to learn all kinds of both interesting and useful things about Github, Markdown and reproducible research in general. Also, I learned about the course from Sisu when looking for useful and interesting courses among the ones listed suitable for my PhD degree (that is, there is a list of courses from which one needs to choose at least 10 study points in order to complete one’s degree).

I was pretty familiar already with the material covered in ‘R for Health Data Science book’ and in the ‘Exercise Set 1’, since I have a strong background in both statistics/data science and in R. However, I’m motivated to learn as many new tools as possible to make my research as open and reproducible as possible. I also wish to learn ways to streamline my research workflows by more fluent integration of coding, writing and publishing via Github, Markdown and R-studio.

Last but not the least, here is a link to my Github repository used during the course: My Github Repo


Chapter 2

students2014 <- read.csv("./data/learning2014.csv")
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Above we have loaded the dataset to be used for this week’s assignment and printed a short summary of the data. The data consists of:
2 variables of type ‘int’ (age, points),
4 variables of type ‘num’ (attitude, deep, stra, surf), and
1 variables of type ‘character’ (gender).

Also, there are 166 observations per variable and in the data are results from a survey with originally 183 respondents. Respondents were course participants and those with zero exam points from the course exam were removed from the data. The variables deep, stra and surf are aggregate variables constructed from multiple responses. More information on the dataset is available here.

Below are marginal distributions for every variable in the data, except for the variable gender, which is a binary variable for which 66% respondents answered F and 34% respondents answered M.

While marginal distributions are useful in assessing the shape of the distribution of individual variables, we may use pairs to draw scatter plots summarizing the pairwise dependencies between different variables.

Based on quick visual assessment, there might be at least some positive correlation between points and attitude. Much of the dependencies are however not clear enough, but can be be a little difficult to read from the pairwise scatter plots. To this end we may also inspect the correlation matrix of the data (below).

##                 age attitude  deep  stra  surf points gender_binary
## age            1.00     0.02  0.03  0.10 -0.14  -0.09         -0.12
## attitude       0.02     1.00  0.11  0.06 -0.18   0.44         -0.29
## deep           0.03     0.11  1.00  0.10 -0.32  -0.01         -0.06
## stra           0.10     0.06  0.10  1.00 -0.16   0.15          0.15
## surf          -0.14    -0.18 -0.32 -0.16  1.00  -0.14          0.11
## points        -0.09     0.44 -0.01  0.15 -0.14   1.00         -0.09
## gender_binary -0.12    -0.29 -0.06  0.15  0.11  -0.09          1.00

Above we have created coded the variable gender as to a binary variable which gets values equal to one if gender = F and otherwise zero. The correlation between points and attitude is the strongest linear dependency, as our visual assessment already suggested.

We shall take take the three variables with the largest absolute correlations with points as our starting point for the analysis. That is, we estimate a linear regression model with points as the dependent variables and three explanatory variables (attitude, stra and surf). The results of our regression model are summarized below (standard errors are not heteroskedasticity robust).

model1 <- lm(points ~ attitude + stra + surf, data = students2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Above reported t-test statistics suggest that the coefficients for stra and surf are not statistically significant from zero, as already suggested by low absolute correlations between them and points. To elaborate, t-test tests the null hypothesis coefficient = 0 for every variable in the model and only the coefficient on attitude was found to be statistically significant (and clearly so!).

To that end, we drop both stra and surf from our regression and run the final model with attitude as the only explanatory variable.

model2 <- lm(points ~ attitude, data = students2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The intercept of the final model is estimated to be 11.64, which can be interpreted as the model predicting exam points to be that much on average for someone with attitude = 0. On the other hand, the more interesting coefficient on attitude is estimated to be 3.53, which suggests that on average, every additional point in attitude raises (or is associated with) 3.53 point rise in exam points. The multiple R-squared for the model is around 19%, suggesting the lone explanatory model to explain that much of the variance of the model (or variation in the dependent variable points).

Next, we shall take a look at some visual model diagnostics and discuss a little about the assumptions of our model. Let’s see the visual diagnostics first.

Above we have plotted Residuals vs Fitted values, Normal QQ-plot, Residuals vs Leverage, points against attitude with our linear regression line going over the scatter plot and finally a histogram of the residuals. First, as both Normal QQ-plot and the histogram of residuals suggests, the residuals of the model are slightly skewed, left tail being thicker than normal, but right tail thinner. That is, residuals do not look normally distributed. Moreover, there does not seem to be too wild outliers in the data, nor are the results dominated by particular observations as suggested by relatively low maximum leverage. The residuals seem also quite homoskedastic (although debatable and probably heteroskedasticity ribust standard errors should be used just in case).

Fortunately linear regression model does not require for us to assume to residuals to be normally distributed. Technically, it only needs to be assumed that the residuals have well defined fourth moments (and in this case homoskedastic). Our coefficient estimates and estimates for their standard deviation will be asymptotically correct. Also we may safely assume the residuals based on survey based data to be uncorrelated with each other. The dependence between points and attitudealso seems fairly linear, as required by our model assumptions.

Thus, our model assumptions seem relatively well justified.


Chapter 3

alc <- read.csv("./data/alc.csv")
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ failures  : num  0.5 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "yes" "yes" ...
##  $ absences  : num  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : num  10 10.5 14 10 11.5 11.5 11 9.5 15.5 10 ...
##  $ G2        : num  11.5 8.5 13 10 11.5 12 5.5 9.5 15.5 10.5 ...
##  $ G3        : num  11.5 8 12.5 9 11.5 11.5 6 9.5 15.5 10.5 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...

Above we have printed some information of the data to be used in this week’s assignment, including variable names as well as data types of the variables. In total we have 370 observations and 35 variables. The data are student level results from a questionnaire in two Portuguese schools (secondary education) and the variables measure student achievement as well as demographic and social features. Based on this data, we shall study the relationship between alcohol consumption and some of the variables.

Of other variables than those related to alcohol consumption, we shall focus our attention on higher, romantic, famrel and G3 (final grade). The working hypothesis is that the variables measuring social aspects (romantic and famrel) could be negatively correlated with alcohol consumption, whereas the variables related to ambitions and achievement (higher and G3) could as well be negatively associated with alcohol consumption, since excessive alcohol could make it more difficult to succeed in school.

For the analysis we need to encode romantic and higher (both take character values yes or no) to a binary variables taking values 1 and zero.

alc$romantic_binary <- ifelse(alc$romantic == "yes", 1, 0)
alc$higher_binary <- ifelse(alc$higher == "yes", 1, 0)

Below we see the marginal distributions of the four explanatory variables for both high_use = TRUE and high_use = FALSE. Variable high_use takes values TRUE if the weekly alcohol consumption exceeds a certain threshold. For 30% of the students in the sample high_use = TRUE.

The conditional marginal distributions of the variables provide preliminary support for all of our hypotheses. First, higher is a binary variable taking zero values only if the student is not thinking about pursuing higher education and thus reflects the educational ambitions. As suspected, among those with lower educational ambitions, high alcohol consumption is much more common. The difference might not look huge in the plot since for only 4% of the students higher = 0, but among those (16) students high alcohol consumption was approximately twice as probable as in the rest of the data! (see table below)

##                   
##                    higher = 1 higher = 0
##   high_use = TRUE           7        252
##   high_use = FALSE          9        102

As for the next variable, whether student is in a romantic relationship does not seem to have much effect on one’s alcohol consumption. However, high_use = TRUE seems to be slightly more common for romantic = “no”, as hypothesized. Good family relations (higher values of famrel) on the other hand seem to be clearly associated with less alcohol consumption, also as hypothesized. Finally, perhaps the most clear negative association (by visual inspection at least) is between high alcohol consumption and final grade, or educational success, which makes perfect sense, and is also as we suspected.

Next, we shall estimate a logistic regression model using the four just discussed variables as explanatory variables and high_use as the dependent variable. Below we have printed out the summary of our results.

# Logistic regression model
model <- glm(high_use ~ higher_binary + romantic_binary + famrel + G3, 
             data = alc, family = binomial(link = "logit"))

# A summary of the estimated model
summary(model)
## 
## Call:
## glm(formula = high_use ~ higher_binary + romantic_binary + famrel + 
##     G3, family = binomial(link = "logit"), data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4342  -0.8506  -0.7251   1.3382   1.9521  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      2.06805    0.77766   2.659  0.00783 **
## higher_binary   -0.90841    0.55444  -1.638  0.10133   
## romantic_binary -0.33272    0.25738  -1.293  0.19611   
## famrel          -0.28414    0.12473  -2.278  0.02272 * 
## G3              -0.07423    0.03716  -1.997  0.04578 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 436.57  on 365  degrees of freedom
## AIC: 446.57
## 
## Number of Fisher Scoring iterations: 4
# Odds ratios
odds_ratios <- exp(model$coefficients)

# Confidence intervals
ci <- confint(model)

# Printing out both odds ratios and their confidence intervals
print(cbind(odds_ratios, exp(ci)))
##                 odds_ratios     2.5 %     97.5 %
## (Intercept)       7.9093629 1.7560948 37.6713039
## higher_binary     0.4031642 0.1315342  1.1948265
## romantic_binary   0.7169700 0.4283564  1.1778900
## famrel            0.7526588 0.5885479  0.9614471
## G3                0.9284590 0.8625985  0.9983319

The estimated coefficient of the logistic regression model are all negative (and equivalently, the point estimates for odds ratios less than one), and thus consistent with our hypothesis of all the included explanatory variables to be negatively associated with the high alcohol consumption. The results are however not very strong. The negative association between alcohol consumption and the first two explanatory variables (romantic and higher) is not statistically significant on any commonly used confidence levels. This comes as no surprise after the inspection of the conditional marginal distributions of those variables, where the difference between high_use = TRUE and high_use = FALSE was quite small. Also, as discussed, there were only 16 students for which higher_binary = 1. Hence, although the point estimate for the coefficient of higher (in absolute value) was actually the highest, more data would be needed to obtain statistically significant results.

On the other hand, the coefficients on famrel and G3 were found to be statistically significant on a 95% level, albeit small-ish in absolute value. Based on these findings, we could say the data and the model to give some support to our hypothesis of both healthy social relations and educational ambitions to be negatively associated with high alcohol consumption.

However, as discussed, the odds ratios for the statistically significant explanatory variables were not very different from one (no predictive power) the 95% confidence intervals for those both almost including one. To make things worse, the point estimates for the statistically not significant coefficients were the largest in absolute value, hence it is very plausible there to be some over-fitting going on. The predictive performance of the model (based on those point estimates) might then not be all that amazing. Let us assess this by first printing out a 2x2 cross tabulation of predictions versus the actual values observed (see below). As a decision boundary we use the value of 50%.

##         prediction
## high_use   0   1
##        0 249  10
##        1 105   6

Clearly, the model inaccurately classifies almost all of the students as high_use = 0. This is no surprise, since for 70% of the students this is indeed the case, thus by classifying every student like this, one would already obtain a respectable (not really) accuracy rate of 70%.

So does our model perform any better than such a simple guessing strategy? Unfortunately it turns out not to be the case… The accuracy rate of our model is only 68.92% (31.08% training error) against the 70% (30% training error) of the simple guessing strategy.

To conclude, the predictive performance of our model is poor, reflecting the weakness of our obtained results. However, the data and the model do give some support to our hypothesis of both healthy social relations and educational ambitions to be negatively associated with high alcohol consumption.


Chapter 4

library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Above we have loaded a dataset called Boston from the MASS package. The dataset consists of crime rates in various towns in Boston area, along with various other town characteristics. In total there are 506 observations (towns) and 14 variables.

Below are the marginal distributions as well as pair plots of all the variables in the data.

par(mfrow = c(3, 5))
par(mar = c(3, 4, 2, 1))
for(i in 1:ncol(Boston)) hist(Boston[,i], main = colnames(Boston)[i], xlab = "", col = "peachpuff")
par(mfrow = c(1, 1))

Clearly there are a lot of irregularities in the distribution of the data. Specifically, many of the marginal distributions are seriously skewed and even multimodal. Hence, no methods assuming normality or near normality of the data should be employed here. Let us next draw a correlation plot to assess the linear dependencies between the variables.

library(corrplot)
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "circle", type = "upper")

There are multiple strong, both positive and negative, correlations between many of the variables suggesting the variables do indeed have some explanatory power over each other. Below we print out the five largest positive and negative correlations in the data.

print_largest_correlations <- function(data, how_many = 5, negative = FALSE, digits = 2) {
  cm <- cor(data); diag(cm) <- NA; n <- ncol(cm)
  if(how_many > n * (n - 1) / 2) how_many <- n * (n - 1) / 2
  cm_order <- order(cm[upper.tri(cm)])
  largest <- ifelse(rep(negative, how_many), cm_order[1:how_many], rev(cm_order)[1:how_many])
  strings <- c(); cors <- c()
  count <- 1
  for(j in 1:n) {
    for(i in 1:n) {
      if(i >= j) next
      if(count %in% largest) {
        cors <- c(cors, cor(data[,i], data[,j]))
        strings <- c(strings, paste0("cor(", colnames(data)[i], ", ", colnames(data)[j], ") = ", 
                   round(cors[length(cors)], digits), "\n"))
      }
      count <- count + 1
    }
  }
  strings <- strings[order(cors, decreasing = !negative)]
  if(!negative) cat("Largest positive correlations: \n") else cat("Largest negative correlations: \n")
  for(i in 1:length(strings)) cat(strings[i])
  cat("\n")
}
print_largest_correlations(Boston)
## Largest positive correlations: 
## cor(rad, tax) = 0.91
## cor(indus, nox) = 0.76
## cor(nox, age) = 0.73
## cor(indus, tax) = 0.72
## cor(rm, medv) = 0.7
print_largest_correlations(Boston, negative = TRUE)
## Largest negative correlations: 
## cor(nox, dis) = -0.77
## cor(age, dis) = -0.75
## cor(lstat, medv) = -0.74
## cor(indus, dis) = -0.71
## cor(rm, lstat) = -0.61

So we have established there to be some linear dependencies between the variables in the data, but what about non-linear dependencies? We may try to visually assess any non-linear dependencies by means of a pair plot (plotted below).

pairs(Boston, gap = 0.5, oma = c(0, 0, 0, 0), pch = 20)

Visual assessment doesn’t immediately reveal us any strong dependencies in addition to those linear enough for to result in correlations with high absolute values. However, some dependencies for which the correlations seemed high seem to exhibit significant non-linearities (see e.g. the relationship between nox and dis) and naive linear models might not be the best approach to model the dependencies in this data.

Before diving to model fitting and further analysis, we shall do some preparations for our data. First we scale the data to have zero mean and unit variance.

boston_scaled <- data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We shall also replace the numerical variable crim (crime rates) with a categorical counterpart with four levels (“low”, “med_low”, “med_high”, “high”) using interquartiles as break points.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, 
             labels = c("low", "med_low", "med_high", "high"))
boston_scaled$crim <- crime

We then divide the data to a train and test sets with a random 80/20 split.

# Random 80/20 split:
# (floor(n * 0.8) returns an integer -> less ambiguous than n * 0.8, also
#  'sample.int' slightly more efficient and less ambiguous than 'sample')
n <- nrow(boston_scaled)
ind <- sample.int(n,  size = floor(n * 0.8)) 
train <- data.frame(boston_scaled[ind,], stringsAsFactors = TRUE)
test <- data.frame(boston_scaled[-ind,], stringsAsFactors = TRUE)

# Save the correct classes from test data (just in case)...
correct_classes <- test$crim

# ...after which remove the crime variable from test data:
test$crim <- NULL

Then, although we have established the non-normality of the variables, we shall use something that assumes the data to follow a multinormal distribution, because why not! Linear discriminant analysis it is (LDA).

# LDA model
lda.fit <- MASS::lda(crim ~ ., data = train)

# The function for LDA biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Target classes as numeric
classes <- as.numeric(train$crime)

# Plot the LDA biplot
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      12        2    0
##   med_low    4      20        2    0
##   med_high   0       6       11    1
##   high       0       0        0   30

Above we have plotted our results by means of an LDA biplot and then printed out the cross tabulation of the predictions made by the model on the test data. Evidently, regardless of the violation of the LDA assumptions, LDA seems to be able to differentiate between different classes and the test error rate seems fairly low. The model has most difficulties in differentiating between categories “med_low” and “med_high” which is to be expected. None of the “low” (“high”) observations is however classified as “med_high” or “high” (“med_low” or “low”) which I’d say is fairly well done.

However, perhaps there are other methods that could perform even better? To that end, let us compare the results attained with LDA to those attained by means of k-means algorithm. For one, k-means algorithm does not assume the variables to follow a multivariate normal distribution so perhaps it performs better than LDA? Let us first reload and scale the data, take quick look at the distances between observations in the data, and then run the k-means algorithm first with four clusters.

# Reload tha original data
library(MASS)
data("Boston")

# Scale the data
boston_scaled <- data.frame(scale(Boston))

# Calculate the (Euclidean) distances of observations in the data
dist <- dist(Boston)
summary(dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047
# Set random seed
set.seed(42)

# Run the k-means algorithm
km <- kmeans(Boston, centers = 4)

# Visualize the results with a pairplot
pairs(Boston, col = km$cluster, gap = 0.5, oma = c(0, 0, 0, 0), pch = 20)

For some variables-pairs the four clusters seem to do a great job at differentiating the data (as depicted by the above pairplot), whereas for others the clusters are less clear. To that end, let us assess how the total of within cluster sum of squares (WCSS) behaves as a function of the number of clusters.

# Set random seed
set.seed(42)

# Set the max number of clusters
k_max <- 10

# Calculate the total within sum of squares for every number of clusters
TWCSS <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# Visualize the results
plot(x = 1:k_max, y = TWCSS, type = "l", xlab = "Number of Clusters", lwd = 2)
grid()

The most drastic changes in TWCSS happen when two clusters are included. After that the decrease in TWCSS is still significant up to the inclusion of four clusters after which any gains are more or less marginal. The originally used four clusters might then be pretty close to the optimal number of clusters.

Although we didn’t really get to do comparisons of the LDA and k-means algorithm in the end, both approaches seemed to be able to fit the data quite well and interesting insights to the data could most probably be drawn with either method.


Chapter 5

human <- readRDS("data/human.rds")
# Pairplot
pairs(human, gap = 0.5, oma = c(0, 0, 0, 0), pch = 20, xaxt = "n", yaxt = "n")

#Correlation plot
corrplot(cor(human), type = "upper")

# Histograms of marginal distirbutions
par(mfrow = c(3, 3))
par(mar = c(2, 3.5, 3, 1))
for(i in 1:ncol(human)) {
  hist(human[,i], col = "white", main = colnames(human)[i], xlab = "", ylab = "")
}
par(mfrow = c(1, 1))

This week we will be working with the ‘human’ dataset orginating from the United Nations Development Programme. Above we have produced a visual summarization of our data, including a pairplot, a correlation plot and histograms of the marginal distributions of variables. Evidently, there are plenty of strong dependencies between the data 8, some less linear than the others. The linear dependencies are well captured by the correlation plot whereas the less linear dependencies are made more clear by the pairplot. For instance, according to the pairplot, perhaps the strongest pairwise dependency is between GNI and Mat.Mor. While they are indeed negatively correlated, the correlation plot alone would give a false impression of only relatively modest dependency between the variables.

As all the variables have positive support, most of the marginal distributions are also relatively skewed, as expected. Only the marginal distribution of Edu.Exp could perhaps be approximated with a normal distribution with acceptable accuracy. Non-linear dependencies and non-normality of marginal distributions are indeed often two phenomena related to each other

We shall begin our analysis of the data by performing a principal component analysis (PCA) with unnormalized data (not very smart, as we will see!).

# Principal component analysis (PCA)
pca_human <- prcomp(human)
s <- summary(pca_human)

# Rounded percentages of variance captured by each component
pca_pr <- round(100*s$importance[2, ], digits = 0)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# Draw a biplot
biplot(pca_human, cex = c(0.6, 0.8), col = c("grey40", "tomato"), 
       xlab = pc_lab[1], ylab = pc_lab[2])

Evidently, the unnormalized data does not allow for meaningful analysis, but the first principal component explains 99.99% of the variance in the data, as also illustrated by the non-informative biplot above. To this end we shall normalize the data (to have zero mean and unit variance for all variables) and repeat the principal component analysis.

# Normalize the data
human_std <- scale(human)

# Principal component analysis (PCA)
pca_human <- prcomp(human_std)
s <- summary(pca_human)

# The variance in the data captured as a function of the number of components
plot(1:ncol(s$imp), s$imp[3,] * 100, ylim = c(0, 100), ylab = "Variance Explained, %", 
     xlab = "Number of Principal Components", type = "l", lwd = 2)
grid()

# Rounded percentages of variance captured by each component
pca_pr <- round(100*s$importance[2,], digits = 0)
pc_lab1 <- paste0(names(pca_pr)[1], 
                 c(" - Quality of Life"),
                 " (", pca_pr[1], "%)")
pc_lab2 <- paste0(names(pca_pr)[2], 
                 c(" - Gender Equality"),
                 " (", pca_pr[2], "%)")

# Draw a biplot
biplot(pca_human, cex = c(0.6, 0.8), col = c("grey40", "tomato"), 
       xlab = pc_lab1, ylab = pc_lab2, xlim = c(-0.2, 0.2))

After normalizing the data our PCA looks much better! This is obvious, as without normalizing the data, the effect of different variables to the analysis is dominated by the scales of the variables. For instance, in the case of our unnormalized data, the variance of GNI was orders of magnitudes larger than that of the other variables, rendering the any other variables irrelevant.

Now the first two principal components explain approximately a reasonable 54% and 16% of the data variability, respectively. Inspection of the biplot (or alternatively of the rotation matrix) gives us some further insights to the results.

The first component seems to capture aspects coarsely related to the quality of life (i.e. for instance health and economics related factors), as two variables related to education and the life expectancy variable are both negatively correlated with the first component, whereas adolescent birth rate as well as maternal mortality ratio are positively correlated with the component. This component alone explains a whopping 54% of the variability of the data!

The second component seems to capture gender equality related phenomena as it is mostly correlated with female labour participation ratio and percent representation of females in parliament. This is another 16% of the data variability explained!

We are now done for this dataset and we shall turn our attention to tea (yes, the drink) next. To that end, next we load results of questionnaire of 300 participants, in which they were asked all kinds of things tea related, and take a look at that data.

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# NB: View() does not play well with markdown documents so we have commented it out...
# View(tea)

#...we can however take a quick peek of the data in the context of this markdown 
# document for example like this:
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Above we have printed out the first few observations of the data as well as variable names and types. In summary, there are 36 variables, all except age of type Factor with two to six different levels. Also, there are obviously 300 rows in the data, one observation per participant.

Next we shall visualize the data by plotting the distributions of answers to a handful of interesting questions as well as the distribution of participant ages.

# Create a new smaller dataset of some interesting variables
tea_2 <- select(tea, one_of(c("breakfast", "dinner", "evening", 
                                 "Tea", "how", "sugar", 
                                 "where", "How", "friends")))

# Shorten some names for clarity
levels(tea_2$how) <- c("bag", "bag+unpck.", "unpck.")
levels(tea_2$where) <- c("store", "store+shop", "shop")

# Plot the variables as well as the distribution of participant ages
pivot_longer(tea_2, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar()

hist(tea$age, col = "white", xlab = "Age", ylab = "", main = "")

Let us then run a Multiple Correspondence Analysis (MCA) on the new data with only the chosen variables.

# MCA
mca <- MCA(tea_2, graph = FALSE)

# Percentage of explained variance by dimensions
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 20))

# Biplot
fviz_mca_biplot(mca, repel = TRUE, ggtheme = theme_minimal())

Above we have plotted a scree plot of the percentage of explained variances by dimensions and a biplot summarizing the results of MCA. Evidently, we have not been hugely efficient in summarizing the aspects of data with fewer dimensions than in the original data, since the first two dimensions (of ten) explain only 12.29% and 12.01% of the total variance.

Accordingly, it is relatively difficult to tease out clear interpretation of the first two dimensions from the biplot, apart from a few clear connections between the variables. For instance, people who prefer their tea unpacked also tend to prefer tea shops instead of chain stores, which isn’t a great surprise.


Chapter 6

Let us start the sixth week’s assignment by loading the datasets to be used in this week’s assignments. Note that as we have prepared the data, we have also saved it in RDS-format (as opposed to a, say, csv-file) in order for all the features of the data to be preserved (namely, categorical variables stay as factors).

BPRSL <- readRDS("data/BPRSL.rds")
RATSL <- readRDS("data/RATSL.rds")

RATS

In the first half of this week’s analysis we shall replicate the analysis from Chapter 8 of Multivariate Analysis for the Behavioral Sciences (MABS), but using the RATS data (which we have loaded in long format above). The data consists of repeated measurements of weights of 16 individual rats which are further divided into 3 different groups with different diets.

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 4)) +
  facet_grid(. ~ Group, labeller = label_both) + 
  theme(legend.position = "none")

Evidently, weights of the rats in group one are much less to begin with than those of in the other two groups. Also, rats in the last group are heavier than those in the second group, apart from one specific rat in the second group that is the heaviest of them all, and is the only one with weight greatly differing from the other rats in the same group. The figure above also clearly depicts that in general the rats have gained weight over the time of the study, some slightly more than others.

The rats with more initial weight tend to be obviously heavier throughout the study (phenomenon known as tracking), making the visual assessment of any treatment effects between groups quite difficult. To this end, we may standardize the data before plotting. That is, in the figure below we have standardized the rats weights to have zero mean and unit standard deviation within each group.

# Standardise the weights within groups
RATSL <- RATSL %>%
  group_by(Group) %>%
  mutate(Weight_std = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()

# Replot, but with standardized weights
ggplot(RATSL, aes(x = Time, y = Weight_std, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 4)) +
  facet_grid(. ~ Group, labeller = label_both) + 
  theme(legend.position = "none")

According to the plot with standardized data, the treatment effect (the weight gained throughout the study) seems slightly smaller for rats in the second group compared to the other two groups. However, the standardizing has also made it clear that there are actually one relatively distinct individual (outlier) in each group, which might have an effect on standardization.

Perhaps plotting the within groups means would help in our visual assessment? As there is however a clear difference in the initial values between the groups, even better idea is to plot mean profiles of weight gained within groups with the associated standard errors.

# Add variable 'Weight_gain'
RATSL$Weight_gain <- NA
Weight_init <- RATSL$Weight[which(RATSL$Time == 1)]
for(i in levels(as.factor(RATSL$Time))) {
  rows <- which(RATSL$Time == as.numeric(i))
  RATSL$Weight_gain[rows] <- RATSL$Weight[rows] - Weight_init
}

# Summary data with mean and standard error of Weight_gain by Group and Time 
RATSL2 <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean_gain = mean(Weight_gain), se = sd(Weight_gain) / length(Weight_gain)) %>%
  ungroup() %>%
  group_by(Group)

# Plot the weight gain mean profiles
ggplot(RATSL2, aes(x = Time, y = mean_gain, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1, 2, 3)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_gain - se, ymax = mean_gain + se, linetype = "1"), width = 0.3) +
  scale_shape_manual(values = c(1, 2, 3)) +
  theme(legend.position = c(0.2, 0.8)) +
  scale_y_continuous(name = "Weight Gain +/- SE, Grams") + 
  scale_x_continuous(name = "Time")

Finally, we are starting to get a clearer picture of what is going on between the groups. The rats in the second group have gained the most weight on average, while the rats from the first group have gained the least. However, we are looking at the actual weight gained in grams here and the relative weight gain in percentages might tell a different story, with respect to the first group at least. To that end let us quickly plot such a figure as well.

# Add variable 'Weight_rel_gain'
RATSL$Weight_rel_gain <- NA
Weight_init <- RATSL$Weight[which(RATSL$Time == 1)]
for(i in levels(as.factor(RATSL$Time))) {
  rows <- which(RATSL$Time == as.numeric(i))
  RATSL$Weight_rel_gain[rows] <- 100 * ((RATSL$Weight[rows] - Weight_init) / Weight_init)
}

# Summary data with mean and standard error of Weight_gain by Group and Time 
RATSL3 <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean_rel_gain = mean(Weight_rel_gain), se = sd(Weight_rel_gain) / length(Weight_rel_gain)) %>%
  ungroup() %>%
  group_by(Group)

# Plot the weight gain mean profiles
ggplot(RATSL3, aes(x = Time, y = mean_rel_gain, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1, 2, 3)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_rel_gain - se, ymax = mean_rel_gain + se, linetype = "1"), width = 0.3) +
  scale_shape_manual(values = c(1, 2, 3)) +
  theme(legend.position = c(0.2, 0.8)) +
  scale_y_continuous(name = "Weight Gain +/- SE, %") + 
  scale_x_continuous(name = "Time")

Indeed, if we look at the relative weight gained, the eventual mean weights in the first and third group are not statistically different from each other anymore. We may treat the the percentage-wise difference in weight gained throughout the study as our outcome of interest. To that end, let us take a closer look at the eventual relative weight gain by plotting a box plot of those values.

# Create another summary data for the boxplot
RATSL4 <- RATSL %>%
  filter(Time == max(Time)) %>%
  group_by(Group, ID) %>%
  summarise(mean = mean(Weight_rel_gain)) %>%
  ungroup()

# Draw a boxplot of the outcome of interest versus the group (or treatment)
ggplot(RATSL4, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
  scale_y_continuous(name = "Average weight gained throughout the study, %")

The boxplot tells us pretty much the same story as the previous plot. Although there is plenty of variability within the groups, the rats in the second group have gained clearly more weight than the two other groups. There is also more variability in the last than in the first group, but the average gain is more or less much the same between those groups. Also, first group seems to feature a bit of an outlier, with one rat gaining somewhat more weight than the other rats within that group. However, nothing suggests that that individual not to be representative of the population of interest and we should not drop this individual from the data (the observation is actually within two standard deviations from the mean, hence not even much of an outlier).

We may finally take a more formal approach to what is implied by our earlier visual assessments, that is has the second group really has gained more weight, by running pairwise t-tests between the groups.

groups <- list()
for(i in 1:3) groups[[i]] <- RATSL4$mean[which(RATSL4$Group == i)]
t.test(groups[[1]], groups[[2]])
## 
##  Welch Two Sample t-test
## 
## data:  groups[[1]] and groups[[2]]
## t = -1.4056, df = 4.0499, p-value = 0.2317
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.436270   5.026318
## sample estimates:
## mean of x mean of y 
##  9.363791 14.568768
t.test(groups[[1]], groups[[3]])
## 
##  Welch Two Sample t-test
## 
## data:  groups[[1]] and groups[[3]]
## t = 0.44946, df = 6.2398, p-value = 0.6683
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.738523  6.895430
## sample estimates:
## mean of x mean of y 
##  9.363791  8.285338
t.test(groups[[2]], groups[[3]])
## 
##  Welch Two Sample t-test
## 
## data:  groups[[2]] and groups[[3]]
## t = 1.5957, df = 4.7498, p-value = 0.1745
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.00114 16.56800
## sample estimates:
## mean of x mean of y 
## 14.568768  8.285338

It turns out that although the differences in means are apparent, the differences are not statistically significant with any commonly used confidence levels. P-values for tests for difference in means between the second group against the two other groups are however clearly smaller than of the test between the first and third group, as expected. The lack of statistical significant is however no surprise, since we have so few observations, with only 8, 4 and 4 individuals in each group, respectively.

BPRS

In the another half of this week’s analysis we shall replicate the analysis from Chapter 9 of Multivariate Analysis for the Behavioral Sciences (MABS), only this time using the BPRS data (which we have also loaded in long format at the beginning of this chapter). The data consists of repeated measurements of 40 male subjects that were assigned randomly to one of two treatment groups. Each subject was rated weekly on the brief psychiatric rating scale (BPRS) for eight weeks and once before the treatment began. The BPRS assesses different symptoms on a scale from one to seven and is used to evaluate patients suspected of having schizophrenia.

Let us begin the analysis by plotting the data.

ggplot(BPRSL, aes(x = weeks, y = bprs, linetype = subject)) +
  geom_line() +
  facet_grid(. ~ treatment, labeller = label_both) + 
  theme(legend.position = "none")

By first impression there seems to be a modest downward trend in BPRS for both treatments over the course of the study, suggesting the treatments to be somewhat successful. There however seems to be more dispersion, or heterogeneity in responses to the treatment, in the second group than in the first, suggesting the first treatment to perhaps have a more uniform effect.

Visual assessment however gets us only so far and in what follows we shall take a more formal approach to our analysis. We shall start with a naive linear regression model and ignore the serial correlation of the observations for now. We use BPRS as the dependent variables and both the time (weeks) and the treatment group (binary) as explanatory variables.

# Linear regression model
BPRS_lm <- lm(bprs ~ weeks + treatment, data = BPRSL)

# Print out a summary of the model
summary(BPRS_lm)
## 
## Call:
## lm(formula = bprs ~ weeks + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## weeks        -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

The results of the linear regression model are summarized above. Evidently, there is a statistically clearly significant downward sloping time trend, which can be interpreted as the BPRS to be around 2.3 points lower every week on average. There is however no statistically significant difference in overall level of BPRS between two treatment groups. This is however by no means an optimal model to use. First, the observations are by no means independent (although conditional on the time trend they could be, under further assumptions). Second, our interest lies in the difference of the treatment effect between two groups, which we could interpret for example as different slope parameters for two groups. However, above we have assumed the groups to share the slope parameter, but have different intercept, which doesn’t seem ideal given the circumstances. Moreover, different subjects have different initial scores, suggesting that individuals might have different intercepts as well.

To the latter end, let us next fit a random intercept model in which all the subjects have their own “random” intercept that follows a normal distribution.

# Random intercept model
BPRS_lme1 <- lmer(bprs ~ weeks + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print out a summary of the model
summary(BPRS_lme1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## weeks        -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) weeks 
## weeks      -0.437       
## treatment2 -0.282  0.000

The point estimates for the parameters of interest stay the same for the random intercept model as for the naive linear regression model (as they should), but letting the intercepts between subjects vary might give us smaller parameter variance (if our random intercept assumption fits the data) and potentially statistically significant results. As expected, this is however not the case here. Parameter variance, or standard errors, do get indeed smaller, but not by much and the treatments do not still differ statistically significantly from another.

Next, let us fit a model with both random intercept and slope parameters. That is, in addition to the intercept, we let the slope parameter to vary between individuals as well.

# Random intercept and random slope model
BPRS_lme2 <- lmer(bprs ~ weeks + treatment + (weeks | subject), data = BPRSL, REML = FALSE)

# Print out a summary of the model
summary(BPRS_lme2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (weeks | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           weeks        0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## weeks        -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) weeks 
## weeks      -0.582       
## treatment2 -0.247  0.000

To our disappointment, the results do not however change much. There are no significant differences between the results of our last two models. We might still be interested in knowing which one better fits the data, that is results of which model should we report or continue working with? To this end we may take a look of log-likelihoods of the models and the associated likelihood ratio test.

# ANOVA test on the two models
anova(BPRS_lme1, BPRS_lme2)
## Data: BPRSL
## Models:
## BPRS_lme1: bprs ~ weeks + treatment + (1 | subject)
## BPRS_lme2: bprs ~ weeks + treatment + (weeks | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_lme1    5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_lme2    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The latter model seems to have slightly higher log-likelihood and likelihood ratio test indeed suggests the difference to be statistically significant on a 5% level (but not on a, say, 1% level).

Finally we will fit a model that, in addition to everything else, allows for time and group interactions. In practice, this way we may facilitate our earlier discussed hypothesis of different slopes between the groups.

# Random intercept and random slope model with time x group interactions
BPRS_lme3 <- lmer(bprs ~ weeks + treatment + (weeks | subject) + weeks * treatment, data = BPRSL, REML = FALSE)

# Print out a summary of the model
summary(BPRS_lme3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (weeks | subject) + weeks * treatment
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           weeks        0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)       47.8856     2.2521  21.262
## weeks             -2.6283     0.3589  -7.323
## treatment2        -2.2911     1.9090  -1.200
## weeks:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) weeks  trtmn2
## weeks       -0.650              
## treatment2  -0.424  0.469       
## wks:trtmnt2  0.356 -0.559 -0.840

Now, according to the model with interactions the slope parameter is indeed greater in absolute value for the first treatment group, which indeed matches our earlier visual assessment. Still, the difference is only barely, or not at all, statistically significant. Computation of exact p-values for linear mixed models is not trivial, nor is it always recommended, but the t-score for the interaction term is 1.785, hence for instance approximate 95% confidence would include zero, while 90% confidence interval would not. We conclude our model thus to provide tentative support in favor of the first treatment over the second. More importantly, the effect of both treatments is statistically different from zero as indicated by the high absolute t-score for the slope.

Actually, the likelihood ratio test (below) suggests our latter model to fit the data only slightly better than the previous one, the difference being statistically significant, again, only on a 90% level, but not on the commonly used 95% level. This further corroborates the fact that we should definitely be cautious with our conclusions and most probably more data is needed to make anything but very tentative conclusion regarding the differences in success between the two treatments.

# ANOVA test on models with and without interactions
anova(BPRS_lme2, BPRS_lme3)
## Data: BPRSL
## Models:
## BPRS_lme2: bprs ~ weeks + treatment + (weeks | subject)
## BPRS_lme3: bprs ~ weeks + treatment + (weeks | subject) + weeks * treatment
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_lme2    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_lme3    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Last but not the least, let us plot the fitted values of the last model against the observed data to get some idea of how well the model actually fits the data (in absolute terms, that is not against some other model). Below the observed values are in red and the fitted values in blue. Clearly the model provides reasonable approximation for the first treatment group, but the model does not account for heteroskedasticity in different groups (that is different variance in different groups). For this reason, the fitted values for the second treatment group seem to be too tightly packed and some subjects fall way outside this pack of fitted values. To conclude there is still some room for improvement, but for now we leave our analysis here.

# Create a new data with fitted values of BPRSL instead of observed
BPRSL_fitted <- BPRSL
BPRSL_fitted$bprs <- predict(BPRS_lme3)

# Combine the data with observed and fitted values and add new column fitted
BPRSL$fitted <- 0
BPRSL_fitted$fitted <- 1
BPRSL_combined <- rbind(BPRSL, BPRSL_fitted)
BPRSL_combined$fitted <- as.factor(BPRSL_combined$fitted)

# Plot the observed values against the fitted values
ggplot(BPRSL_combined) +
  geom_line(aes(x = weeks, y = bprs, linetype = subject, col = fitted)) +
  facet_grid(. ~ treatment, labeller = label_both) + 
  theme(legend.position = "none")

I suppose this concludes the last assignment of this course. I hope you had a nice course, whoever you are, and I wish you happy holidays!